I like to start off talks about reproducibility in science with some humor. This video is a few years old, but it has some timeless insights.
What’s the point? That even the most well-meaning of us can make careless errors that undermine the reproducibility of science.
But, is it a crisis?
In 2016, Nature published the results of a survey of 1,500 scientists (Baker 2016). They were asked a number of questions, including the following:
Can someone pick up where you left off without significant loss of time and momentum?
Think of each step of your data workflow
# Import/gather data
# Clean data
# Visualize data
# Analyze data
# Report findings
Imagine writing R code to handle each step.
# Import data
my_data <- read.csv("path/2/data_file.csv")
# Clean data
my_data$gender <- tolower(my_data$gender) # make lower case
...
You could put all the code in one script file, or even better, have separate scripts for each step that you source() one by one.
# Import data
source("R/Import_data.R") # source() runs scripts, loads functions
# Clean data
source("R/Clean_data.R")
# Visualize data
source("R/Visualize_data.R")
...
RStudio has a ‘projects’ function that I strongly recommend you use.
Here’s an example of some recent projects I’ve worked on.
Using RStudio projects helps keep your files and settings organized. It’s easy to switch between projects. It reduces mental effort (what directory am I in?), and especially avoids having to use directory-settign commands like setwd() that will only work on your computer RStudio projects also integrates with version control (e.g., GitHub).
# Big idea
## Smaller idea in service of bigger
- Supporting point
- Another suppporting point
1. an enumerated **bold** point
1. an enumerated *italicized* point
- a [link](http://psu-psychology.github.io/r-bootcamp) to this bootcamp
- an image: 
- an equation: $e = mc^2$
# Some R code
ggplot2::qplot(rnorm(100))
# Some R code
ggplot2::qplot(rnorm(100))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
- [pdf_document](http://rmarkdown.rstudio.com/pdf_document_format.html), [word_document](http://rmarkdown.rstudio.com/word_document_format.html), or [github_document](http://rmarkdown.rstudio.com/github_document_format.html)
- [ioslides_presentation](http://rmarkdown.rstudio.com/ioslides_presentation_format.html) for HTML slide show
- Cool interactive web-apps in Shiny
- Web sites like the one for this [bootcamp](https://github.com/psu-psychology/r-bootcamp-2018), [blogs](https://bookdown.org/yihui/blogdown/), even [books](https://bookdown.org/yihui/bookdown/)
I’ve analyzed the survey data you provided using an R Markdown document bootcamp-survey.Rmd. Let’s open it up and see how it looks.
The default format is an html_document but I can easily produce outputs in different formats simply by changing a parameter.
rmarkdown::render('talks/bootcamp-survey.Rmd', output_format = "pdf_document")
rmarkdown::render('talks/bootcamp-survey.Rmd', output_format = "word_document")
rmarkdown::render('talks/bootcamp-survey.Rmd', output_format = "ioslides_presentation")
rmarkdown::render('talks/bootcamp-survey.Rmd', output_format = c("pdf_document", "word_document", "github_document", "ioslides_presentation")
So, I can prepare one document but many different output formats. Your adviser likes PDF? No problem. Your collaborator prefers MS Word? Got it covered. Need to give a quick brown bag talk you can give from any web browser? Easy.
papaja
The following is section is copied verbatim from Mike Frank & Chris Hartgerink’s tutorial on GitHub.
There are three reasons to write reproducible papers. To be right, to be reproducible, and to be efficient. There are more, but these are convincing to us. In more depth:
To avoid errors. Using an automated method for scraping APA-formatted stats out of PDFs, (Nuijten et al. 2015) found that over 10% of p-values in published papers were inconsistent with the reported details of the statistical test, and 1.6% were what they called “grossly” inconsistent, e.g. difference between the p-value and the test statistic meant that one implied statistical significance and the other did not. Nearly half of all papers had errors in them.
To promote computational reproducibility. Computational reproducibility means that other people can take your data and get the same numbers that are in your paper. Even if you don’t have errors, it can still be very hard to recover the numbers from published papers because of ambiguities in analysis. Creating a document that literally specifies where all the numbers come from in terms of code that operates over the data removes all this ambiguity.
To create spiffy documents that can be revised easily. This is actually a really big neglected one for us. At least one of us used to tweak tables and figures by hand constantly, leading to a major incentive never to rerun analyses because it would mean re-pasting and re-illustratoring all the numbers and figures in a paper. That’s a bad thing! It means you have an incentive to be lazy and to avoid redoing your stuff. And you waste tons of time when you do. In contrast, with a reproducible document, you can just rerun with a tweak to the code. You can even specify what you want the figures and tables to look like before you’re done with all the data collection (e.g., for purposes of preregistraion or a registered report).
papaja packageIt’s possible to write a paper like this from an R Markdown document that looks like this. Let’s peek under the hood just a bit.
So, there’s much more to say about how to do this than we have time for today. This guide or this are good places to start. But I think we can all agree that pushing a button to render a complete paper, including tables, figures, and references, is pretty amazing.
Track changes is great? Right? But if you’ve ever written a lengthy document with other people, you’ve experienced the challenge of tracking versions across time. At some point, the changes become too extensive to track, and so the author(s) decide to accept or reject a bunch and create a new version. This is how version control becomes an extension of the track changes problem. Most of us have experienced something like this sequence: ‘paper.docx’, ‘paper_new.docx’, ‘paper_new_new.docx’, ‘paper_new_new_ROG.docx’, etc.
My current scheme with colleagues is something like this: ‘nsf_grant_2018-08-16v1.docx’, ‘nsf_grant_2018-08-16v2.docx’, etc. That is, each person who modifies the document saves it as a new version. It doesn’t avoid conflicts if we’re working in parallel, but it does help us track down where we went astray.
Imagine a scheme for doing this automatically with your R and RStudio files? RStudio incorporates two ‘version control’ systems from the software development world, ‘git’ and ‘subversion’. I use ‘git’ and a web-based service for managing projects that use git called GitHub.
We don’t have time to go into git and GitHub here, but I strongly recommend Jenny Bryan’s tutorial Happy Git and GitHub for the useR. In the meantime, this is the workflow I use for almost every project I do that will involve R:
File/New Project.../File/Open Project...These videos show this workflow in action.